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ABSTRACT 

Role Based Access Control (RBAC) is a very popular ac- 
cess control model, for long time investigated and widely 
deployed in the security architecture of different enterprises. 
To implement RBAC, roles have to be firstly identified within 
the considered organization. Usually the process of (auto- 
matically) defining the roles in a bottom up way, starting 
from the permissions assigned to each user, is called role 
mining. In literature, the role mining problem has been for- 
mally analyzed and several techniques have been proposed 
in order to obtain a set of valid roles. 

Recently, the problem of defining different kind of constraints 
on the number and the size of the roles included in the result- 
ing role set has been addressed. In this paper we provide 
a formal definition of the role mining problem under the 
cardinality constraint, i.e. restricting the maximum number 
of permissions that can be included in a role. We discuss 
formally the computational complexity of the problem and 
propose a novel heuristic. Furthermore we present experi- 
mental results obtained after the application of the proposed 
heuristic on both real and synthetic datasets, and compare 
the resulting performance to previous proposals. 

1. INTRODUCTION 

Complex organizations need to establish access control poli- 
cies in order to manage access to restricted resources. A 
simple way to accomplish this task is to collect set of per- 
missions in roles and then assign roles according to the re- 
sponsibilities and qualifications of each employee. The Role 
Based Access Control (RBAC) is a well known paradigm 
to define and organize roles and permissions in an efficient 
way. Introduced in the early '90 years [2] FT] ,such a paradigm 
has been investigated for long time and has become recently 
used in different commercial systems to manage identities 
and accounts [2]. The goal of RBAC is to collect set of per- 
missions in roles and define a complete and efficient set of 
roles that can be assigned to users in order to access re- 
stricted resources. The advantage is that access control can 



be centralized and decoupled from users and the costs and 
the overhead of the security management can be reduced. 

The correct definition of the set of roles which satisfies the 
needs of the organization is the most difficult and costly task 
to be performed during the implementation of a RBAC sys- 
tem. Such an activity is often referred to as role engineering 
and includes the correct identification of roles from the cur- 
rent structural organization of the enterprise. Mainly this 
task, i.e. the extraction of a complete and efficient set of 
roles, can be performed using two approaches: top-down or 
bottom-up role engineering. In the first case, roles are de- 
fined after that the functionalities of the organization have 
been well defined and studied, and elicitation activities have 
been performed. The top down approach usually is labor in- 
tensive and involves a large amount of work and time done 
by humans especially in large enterprises with a large num- 
ber of business processes, as reported in some case study 
are available in the literature [18]. On the other hand, the 
bottom-up process, often denoted also as role mining starts 
from the analysis of the current set of permissions assigned 
to users, and tries to aggregate them in order to extract 
and define roles. Obviously, hybrid approaches can exist 
in which both directions, top-down and bottom-up, can be 
used in subsequent steps of the analysis in order to refine 
the returned set of roles. 

Bottom-up approach to role mining has been more inves- 
tigated, since many techniques borrowed from data mining 
can be applied in an automatic way to the existing config- 
uration of user-permission assignments. A RBAC system 
can be easily constructed in this way and a starting set of 
roles can be fastly generated. The problem with such an 
approach, is that the quality and the number of returned 
roles often are not so good, since no semantics is taken into 
consideration when the role mining process is started. In 
many situations the returned set of roles might not match 
any functional organization within the analyzed enterprise 
and the existing business processes might not be adequately 
represented. An accurate analysis of the returned results is 
needed to better tune the retrieved representation of roles 
to the current organizational requirements of the enterprise. 
A formal definition of the Role Mining Problem (RMP) and 
some of its variants has been given and deeply analyzed in 
[23| . There, the NP-completeness of the (decisional) RMP 
problem has been proved, and the formulation of RMP as a 
graph covering problem has been done in [6J [26] . 



The problem of imposing constraints on the set of roles re- 
sulting after the role mining process has been considered in 
the past. Statically or dynamically mutually exclusive roles 
constraints have been included in RBAC models [19] and 
in the NASI/NIST standards u]. According to these con- 
straints, for examples, no user can be assigned contemporary 
a given set of mutually exclusive roles, or no user can acti- 
vate simultaneously a set of roles in the same session. Such 
constraints are often used as mechanisms to achieve separa- 
tion of duty principles, preventing one user from having a 
large number of roles to complete a task, or imposing restric- 
tions on the minimum number of users needed to perform a 
critical activity [14] , 

Recently, a simple classification of the constraints on the 
cardinality of the number of roles, users and permissions for 
a given RBAC system has been proposed 15 . The first 
heuristic taking into account a cardinality constraint on the 
number of permissions contained in a role has been pro- 
posed by [l3]. In its work, however, the proposed results 
have been compared only on other heuristics which were not 
able to consider constrained roles. In this work we propose 
a novel heuristic for mining RBAC roles under cardinality 
constraint. The algorithm is based on a previous proposal 
[I], where an initial candidate role set was constructed by 
considering one role for each user on the basis of the current 
assignment of permissions. The role set is then refined and 
updated by eliminating the roles obtained as union of other 
roles already included in the set and ensuring that the car- 
dinality constraint is respected. Finally, an optimization of 
the role set is performed by running a lattice reduction pro- 
cedure, previously described in Rjl. The resulting procedure 
is very efficient in terms of computation time and quality 
of returned role set. To this aim we present the results ob- 
tained by running our heuristics on different datasets, some 
available over the network, some artificially created. The 
results are compared with our implementation of the algo- 
rithm presented in [13] and analyzed in terms of the metrics 
presented in 17 . 



The remainder of this paper is organized as follows. In the 
next section we discuss related works. Section [2] contains 
the preliminary concepts needed to define the constrained 
role mining problem and the discussion on its complexity. 
In section HI we introduce our heuristics and compare the 
solution with related work in section [5] Finally Section [6] 
presents our conclusions and ongoing work. 

2. CONSTRAINED RMP 

In this section we introduce the Constrained Role Min- 
ing Problem and analyze its computational complexity 
showing that it is NP-complete while its optimization ver- 
sion is NP-hard. 

2.1 Basic Definitions 

The notation we use is borrowed from the NIST standard 
for Core Role-Based Access Control (Core RBAC) |7 and it 
is adapted to our needs. 

We denote with U — {ui,. .. ,u n } the set of users, with 
V = {pi,...,p m } the set of permissions, and with 1Z — 
{n, . . . , r s } the set of roles, where, for any r G TZ, we have 
r C V . We also define the following relations. UTZA CWxR 



is a many-to-many mapping user-to-role assignment rela- 
tion. VJPA C TZ x V is a many-to-many mapping role- 
to-permission assignment relation. UVA C U x V is a 
many-to-many mapping user-to-permission assignment re- 
lation. TZTi C 1Z x TZ is a partial order over TZ, which is 
called a role hierarchy. When (ri,r2) € TZTi, we say that 
the role n is senior to r%- 

When needed, we will represent the assignment relations 
by binary matrices. For instance, by UP A we denote the 
UVA's matrix representation. The binary matrix UPA satis- 
fies UPA[i][j] = 1 if and only if (ui,pj) € UPA. This means 
that user Ui possesses permission pj. In a similar way, we 
define the matrices URA, RPA, and RH. 

Given the n x m users-to-permissions assignment matrix 
UPA, the role mining problem (see [23], rcj], and p|) con- 
sists in finding an n x k binary matrix URA and a k x m 
binary matrix RPA such that, UPA = URA <g> RPA, where, the 
operator ® is such that, for 1 < i < n and 1 < j < m, 
UPA [*][j] = \/ k h=1 (VRA[i][h] ARPA[/i][j]). Therefore, in solving 
a role mining problem, we are looking for a factorization of 
the matrix UPA. The smallest value k for which UPA can be 
factorized as URA ® RPA is referred to as the binary rank of 
UPA. A candidate role consists of a set of permissions along 
with a user-to-role assignment. The union of the candidate 
roles is referred to as candidate role-set. A candidate role-set 
is called complete if the permissions described by any UPA's 
row can be exaclty covered by the union of some candidate 
roles. In other words, a candidate role-set is complete if and 
only if it is a solution of the equation URA (g> RPA. In this 
paper we consider only complete candidate role-set. 

By adding a contraint t on the number of permissions that 
can assigned to any roles, the t-constrained role mining prob- 
lem can be defined, as follows. Given an n x m users-to- 
permissions assignment matrix UPA and a positive integer 
t > 1, find an n x k binary matrix URA and a k x m binary 
matrix RPA such that UPA = URAcgiRPA and, for any 1 < i < k, 
one has \{j : RPA[z][j] = 1}| < t. The computational com- 
plexity of the above define problem will be discussed in the 
next section. 

2.2 NP- Completeness 

The computational complexity of the Role Mining Prob- 
lem (and of some of its variants) was considered in several 
papers (see, for instance, [23], [5], p|, and 24 ). In this sec- 
tion we define the decisional version of the t-CONSTRAlNED 
Role Mining Problem and we show that it is NP-complete 
(its optimization version is NP-hard). Next we recall the de- 
cisional version of the Role Mining Problem. 



Problem 2.1. (Role Mining Decision Problem) Given 
a set of users U, a set of permissions V , a user-permission 
assignment UVA, and a positive integer k < min{|W|, \V\}, 
are there a set of roles TZ, a user-to-role assignment U7ZA, 
and a role-to-permission assignment 1ZV A such that \TZ\ < k 
and UPA = URA (g) RPA? 



In [23] it was shown that Problem 2.1 is NP-complete. This 



has been proved by reducing it to the Set Basis Decision 



Problem (problem SP7 in Garey and Johnson's book |9]) 
which was shown to be NP-complete by Stockmeyer in [22] . 

The decisional version of the t- constrained role mining prob- 
lem can be defined, as follows. 



Problem 2.2. (^-Constrained Role Mining Decision 
Problem) Given a set of users U , a set of permissions V , a 
user-permission assignment UVA, and two positive integers 
t and k, with t > 1 and k < min{|W|, \V\}, are there a set 
of roles 1Z, a user-to-role assignment U1ZA, and a role-to- 
permission assignment 1ZVA such that [R\ < k, UPA = URA 
® RPA, and, for any r £1Z, \r\ <t? 



To prove that the above defined problem is NP-complete we 
have to show that it is in NP, that another NP-complete 
problem, say II, can be reduced to it (i.e., any instance of 
the problem II can be transformed into an instance of the 

i-CONSTRAINED ROLE MINING DECISION PROBLEM), and 

that the reduction can be done in polynomial time. The 
problem II used in our simple proof, is the Role Mining 
Decision Problem (i.e, Problem |2.1[ |. 

Theorem 2.1. The ^-Constrained Role Mining De- 
cision Problem is NP-complete. 

Proof. The problem is in NP. Indeed the set 72. and the 
matrices URA and RPA constitute a certificate/witness verifi- 
able in polynomial time. 

Assume we are given an instance of the Role Mining De- 
cision Problem consisting of a set of users U' , a set of 
permissions V' , a user-permission assignment UVA' , and a 
positive integer k! < min{|W'|, IT 5 '!}. We show how to trans- 
form it into an instance of the f-CONSTRAiNED Role Min- 
ing Decision Problem. The reduction is trivial. Indeed, 
it is enough to set U = W, V = V , UVA = UVA', and 
k — k' and define 



t = max \{pj e V : UPA[i][j] 



1}I 



It is immediate to see that the above reduction can be done 
in polynomial time and that a solution to the t-CONSTRAlNED 
Role Mining Decision Problem directly provides a solu- 
tion to the Role Mining Decision Problem. Thus, the 
theorem holds. □ 



Next we define the optimization version of the i-CONSTRAlNED 
Role Mining Problem and we show that it is NP-hard. 



Problem 2.3. (^-Constrained Role Mining Optimiza- 
tion Problem) Given a set of users U , a set of permissions 
V , a user-permission assignment UVA, and a positive inte- 
ger t, what is the smallest integer k for which there is a set 
of roles 1Z, a user-to-role assignment U1ZA, and a role-to- 
permission assignment 1ZVA such that [R\ — k, UPA = URA 
® RPA, and, for any r £lZ, \r\ < t? 



Theorem 2.2. The £-Constrained Role Mining Op- 
timization Problem is NP-hard. 



Proof. The ^-Constrained Role Mining Optimiza- 
tion Problem is NP-hard, because there exists a trivial 
polynomial time reduction from the ^-Constrained Role 
Mining Decision Problem to the ^-Constrained Role 
Mining Optimization Problem. Indeed, we can use an 
algorithm solving the optimization problem as an oracle to 
solve the decision problem simply by checking whether the 
solution of the associated optimization problem has cardi- 
nality less than or equal to k. □ 

In [22], Stockmeyer proved that the Set Basis Decision 
Problem is NP-complete by reducing to it the Vertex 
Cover Decision Problem (one of Karp's 21 NP-complete 
problems 11 , see also problem GT1 in [9]). The Vertex 
Cover Optimization Problem is APX-complete IB], that 
is, it cannot be approximated within any constant factor 
in polynomial time unless P=NP. Therefore, we have the 
following simple non-approximability result: 

Theorem 2.3. The ^-Constrained Role Mining Op- 
timization Problem cannot be approximated within any 
constant factor in polynomial time unless P—NP. 

3. RELATED WORKS 

Role engineering has been firstly introduced by Coyne et al 
in 12] where the definition of a top down process for the def- 
inition of roles has been discussed. Along the same research 
line, several other works have been presented Il8], but re- 
cently, the focus of role engineering has turned to consider 
more automated techniques, based on the bottom up ap- 
proach, where data mining techniques are applied for the 
definition of roles [12]. Role mining algorithms have been 
presented based on set covering [5], graph theory pj] |26|, 



subset enumeration 25 , database tiling 23 . The theoret- 
ical aspects of the RMP have been considered in [24[ |23[ 
[3] , where the complexity of the problem has been analyzed 
and its equivalence to other known optimization problem 
showed. Another interrelated problem, i.e. dealing with the 
semantic meaning of roles, has been addressed in [16| . 

Cardinality constraints on the number of permissions in- 
cluded in a role have been firstly considered in 13 , and a 



heuristic algorithm called Constrained Role Miner (CRM) 
has been proposed. The CRM algorithm takes in input the 
UPA matrix and returns a set of roles, each one satisfying 
the given cardinality constraint. CRM is based on the idea 
of clustering users having the same set of permissions and se- 
lecting, as candidate roles, the roles composed of the set of 
permissions satisfying the constraint and having the high- 
est number of associated users. In [13], the performances 
of the algorithm are evaluated on real world datasets con- 
sidering different metrics (such as the number of returned 
roles, the sum of the size of the user assignment and permis- 
sion assignment matrices and so on), with respect to other 
previously proposed algorithms. However the comparison is 
performed without considering constraints, since the other 
algorithms return a complete set of roles but have not the 
capability of managing constraints. In section [5] we evaluate 
our proposal against the result obtained after our implemen- 
tation of the CRM algorithm, considering both real world 
and synthetic datasets. A different kind of cardinality con- 
straints, considering the number of user-role assignments, 



have been considered in [To]. Such constraints can be useful 
when the number of persons that can be assigned to a given 
role (e.g. the number of directors, managers, etc) in a given 
organization is known or can be fixed. In the paper, three 
algorithms have been proposed based on a graph approach, 
where the role mining problem is mapped to the problem of 
finding minimum biclique cover of the edges of a bipartite 
graph. The three heuristics are obtained by modifying the 
basic greedy algorithm proposed in p\ , and experimental re- 
sults on real world datasets are reported considering some 
constraints on the number of users that can be assigned to 
any role. Finally cardinality constraints have also been con- 
sidered in fl5] where a representation of the constraints in 
terms of association rules is proposed: permissions are re- 
garded as attributes, and each user-permission assignment 
as a transaction. To generate mutually exclusive permissions 
constraints, i.e. set of permissions that cannot be assigned 
to the same role, an algorithm is proposed, based on known 
techniques for mining association rules in databases, and its 
performance evaluated on synthetically generated datasets 

4. HEURISTICS 

In this section we present a family of heuristics. Each heuris- 
tic takes as input the matrix UPA and returns a complete role 
set satisfying the cardinality constraint (i.e., at most t per- 
missions are associated to each role). We borrow the ideas 
from the heuristics presented in [TJ and we adapt them to 
handle the cardinality constraint. 

The basic idea is to select from UPA all rows having less 
than t permissions in an order that will be defined below. 
Such rows will correspond to candidate roles that will be 
added to the candidate role-set. If there is no row having 
at most t permissions, then a row is selected and, t of the 
permissions included in the row are chosen (the way we select 
such row and permissions gives rise to different heuristics). 
The selected permissions induce a role that is added to the 
candidate role-set. Then, all rows covered by the candidate 
row-set are removed from UPA and the procedure is iterated 
until the UPA matrix contains some rows. 

The above sketched procedure is more formally described by 
Algorithm [l] where we use the following notation. Given an 
a x b binary matrix M, for 1 < i < a, with M[i] we denote 
the M's i-th row; while, with |M[i]| we denote the number 
of ones appearing in M[i]. The procedures numCols(M) and 
numRows(M), return the number of columns and rows, re- 
spectively, of the matrix M. For a set S and an integer h, the 
procedure first(S', h) returns the first h elements listed in 
the set S. Given a user-permission assignment matrix UPA, 
a new candidate role is generated by selecting a UPA's row 
having the least number of permissions with ties broken at 
random (Lines 6-8 of Algorithm [l]) . If the number of per- 
missions associated to the selected row (i.e., the number of 
ones in UPk[selectedRow]) is at most t, then a new role is 
created (Line 9 of Algorithm [T]) . The new role, containing 
all permissions associated to the selected row, is then added 
to the candidate role-set (Line 21 of Algorithm III . In this 
algorithm, the matrix uncoveredP represents the users' per- 
missions that are not covered by the roles in candidateRoles. 
Once we discover a role (i.e, newRole) to be added to the 
candidateRoles set, running SEtToZero (see Algorithm [2| 
we update the matrix uncoveredP according to newRole. 



All rows whose permissions are covered by the candidate 
roles are removed from both matrices uncoveredP and UPA 
(Lines 22-23 of Algorithm III. removeCoveredUsers's 
pseudo-code is quite similar to the pseudo-code for SET- 
ToZero, hence we omit il[j Algorithm [l] halts when all 
UPA's rows have been removed (Line 5 of Algorithm IT} . 
If the number of permissions exceeds the cardinality con- 
straint, then two possible ways of selecting the role to be 
added to the candidate role-set have been considered . These 
two possibilities gave rise to two heuristics referred to as t- 
SMA.R-0 and t-SMA R -l, respectively. In t-SMA fl -0 (i.e., 
when selection is set to in Algorithm [l]) , the new role will 
simply contain the first t permissions associated to the se- 
lected row (Lines 10-11 of Algorithm [TJ). While, in t-SMA R - 
1 (i.e., when selection is set to 1 in Algorithmfl]) we select a 
row (Lines 13-16 of Algorithm IT]) of the matrix uncoveredP 
having the least number of permissions, ties broken at ran- 
dom. In other words, we select a row (i.e, a users) having the 
least number of permissions still uncovered. If the selected 
row is associated to more than t permissions, then the new 
role will only include its first t permissions (Lines 17-19 of 
Algorithm [TJ). 

Algorithm 2 SEtToZero(UPA, uncoveredP, newRole) 

1: np <— numCols(UPA) 

2: nr <— NUMRows(UPA) 

3: for i — 1 to nr do 

4: permissions <— {pj : 1 < j < np and UPA[i][j] = 1} 

5: if newRole C permissions then 

6: for all j such that Pj £ newRole do 

7: uncover edP[i][j] <— 

8: end for 

9: end if 
10: end for 

11: Remove from uncoveredP all-zeroes rows 
12: return uncoveredP 

Algorithm IT] returns a set of roles (i.e., rows and subsets 
of rows) exactly covering the UPA matrix. As described in 
[TJ, instead of covering the matrix UPA using its rows we 
could use its columns. We refer to such a new heuristic 
based on columns as i-SMAc- The only difference between 
heuristics t-SMAfl and t-SMAc is the way a role is com- 
puted. Heuristic t-SMAc selects a permission p (i.e., a UPA 
column) associated to the least number of users. Setting 
M(p) ~ {v. £ U : {u,p) £ U1ZA} (i.e, all users having per- 
mission p), the role r v induced by permission p is defined as 
r v = {p : U{p) C U{p')} U {p}. If \r p \ < t, then we add it to 
the candidate role-set; otherwise, we add to the candidate 
role-set a role comprising the "first" t permissions in r p . As 
for heuristics t-SMAa, rows covered by roles in the candi- 
date role-set are removed. We iterate this process until the 
UPA matrix contains some rows. t-SMAc's pseudo-code is 
quite similar to the one for i-SMA_R, hence we omit it. 

5. EXPERIMENTAL RESULTS 

In this section we evaluate the proposed heuristic by pre- 
senting some experimental results obtained executing our 



1 Actually, in our implementation SEtToZero updates both 
uncoveredP and UPA, but in Algorithm [TJ for the sake of 
clarity, we prefer to keep separate the updating of such ma- 
trices. 



Algorithm 1 t-SMA fl (UPA, t, selection) 
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candidateRoles <— 
uncoveredP <— UPA 

np <— numCols(UPA) /* np is equal to the number of permissions */ 
nr 4— numRows(UPA) /* nr is equal to the number of users */ 
while numRows(UPA) > do 
m 4— min{|UPA[i]| : 1 < i < nr} 

candidateRows <— {i : 1 < i < nr and UPA[i] = m} 
selectedRow <~_r candidateRows 

newRole <— {pj : 1 < j '• < np and UPA [selectedRow][j] — 1} 
if \newRole\ > t and selection == then 

newRole <s— &rst(newRole, t) 
else if \newRole\ > t and selection —— 1 then 
m <— mm{\uncoveredP[i]\ : 1 < i < nr} 

candidateRows <— {i : 1 < i < nr and |unco«eredP[z]| = m} 
selectedRow «— _r candidateRows 

newRole <s— {p., : 1 < j < np and uncoweredP[se/eciedi?o«)][j] = 1} 
if |neuii?o/e| > t then 

newRole <— Hrst(newRole, i) 
end if 
end if 

candidateRoles <— candidateRoles U {new_Ro/e} 
uncoveredP <— setToZero(UPA, uncoveredP, newRole) 
UPA <— removeCoveredUsers(UPA, candidateRoles) 
end while 
return candidateRoles 



Dataset 


#Users 


#Perms 


UPA 


min^Perms 


max# Perms 


Density 


Healtcare 


46 


46 


1,486 


7 


46 


70.23% 


Domino 


79 


231 


730 


1 


209 


4.00% 


Emea 


35 


3,046 


7,220 


9 


554 


6.77% 


Firewall 1 


365 


709 


31,951 


1 


617 


12.35% 


Firewall2 


325 


590 


36,428 


6 


590 


19.00% 


Apj 


2,044 


1,164 


6,841 


1 


58 


0.29% 


Americas large 


3,485 


10,127 


185,294 


1 


733 


0.53% 


Americas small 


3,477 


1,587 


105,205 


1 


310 


1.91% 


Customer 


10,021 


277 


45,427 





25 


1.64% 



Figure 1: Real- world datasets 



heuristics on several input test cases and report some com- 
parisons of their performance to previous proposals. We 
compare our heuristics with the one described in [13] (from 
now on denoted CRM). As far as we know, 113] is the only 
paper to have considered the problem of constructing a role 
set under cardinality constraints on the roles. In this sense, 
that is the first comparison between two heuristics dealing 
with cardinality constraints, since in [13] much of the discus- 
sion of the experimental results focused on the comparisons 
with other heuristics having no limitations on the size of the 
roles. 

The comparison takes into account the metrics introduced 
in [17|. The goal is to validate our proposal, by showing 
that its performance, regarding both the execution speed 
and the quality of the returned role set, is equivalent or 
better than the one returned by CRM. We would like to 
point out that, using our implementation of CRM, in some 
cases we obtained different values from the ones presented 
in [131. This could be due to different choices in the two 



implementations (for instance, in our implementation, ties 
broken at random while it is not clear how they are handled 
in [13]). Moreover, we had to resolve some ambiguities we 
found in the description of Algorithm 2 in Um . 

All heuristics have been implemented by using Scilab [21] 
Version 5.3.0 on a MacBook Pro running Mac OS X 10.6.7 
(2.66 Ghz Intel Core i7, 4GB 1067Mhz DDR3 SDRAM). In 
the next section, we compare our heuristics with respect to 
CRM over available real-world datasets; while, in Section 
|5.2[ we present the results obtained running the implementa- 
tion of the heuristics over synthetically generated datasets. 

5.1 Real- world datasets 

In this section we compare, on real-world datasets described 
in Table [T] CRM heuristic with our t-SMA R and t-SMA c 
heuristics. Such real-world datasets are available online at 
HP Labs [20] and have been used for evaluation of several 
other role mining heuristics [m |17[ 13] . The datasets ameri- 
cas small and americas large have been obtained from Cisco 







Parameters 


Dataset 


Heuristic 


NR 


|RH| 


|URA| 


|UPA| 


SI 


S2 


CPU time 


Healtcare 


t-SMA R 


16 


25 


352 


429 


806 


822 


0.0107 


t-SMAc 


14 


23 


317 


354 


694 


708 


0.0263 


CRM 


14 





317 


53 


370 


384 


0.0940 


Domino 


t-SMA R 


20 


30 


142 


627 


799 


819 


0.0176 


t-BMA c 


22 


42 


186 


628 


856 


878 


0.0720 


CRM 


20 





177 


564 


741 


761 


0.1604 
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Figure 2: Results of the three heuristics over the real-world datasets 



firewalls granting access to the HP network to authenti- 
cated users (users' access depends on their profiles). Similar 
datasets are apj and emea. The healthcare dataset was ob- 
tained from the US Veteran's Administration; the domino 
data was from a Lotus Domino server; customer is based 
on the access control graph obtained from the IT depart- 
ment of an HP customer. Finally, the firewalll and firewall2 
datasets are obtained as result of executing an analysis algo- 
rithm on Checkpoint firewalls. The main characteristics of 
the nine datasets are reported in Table fT] where we list the 
number of users and permissions (second and third columns, 
respectively), the overall number of permissions (i.e., |UPA|), 
the minimum and maximum number of permissions assigned 
to a user (sixth and seventh column, respectively), and the 
UPA's density (i.e., the ratio between |UPA| and the UPA size 
- #Users x #Perms). 

The considered metrics are: the number of roles (NR), the 
size of the role hierarchy (|RH|), the size of the user-to-role 
assignment matrix (|URA|), the size of the role-to-permission 
assignment matrix (|RPA|), the sum URA + RPA + |RH| de- 
noted by SI, the size of NR + |URA| + |UPA| + |RH| denoted 
by 5*2, and the execution time expressed in seconds. This is 
not at all equivalent to real-world time, but we used those 
data to compare CPU usage among different heuristics as 
it is irrespective of background processes that might slow 
down the execution. 

We first tested the heuristics when there is no constraint on 
the role size (i.e., we set t equal to max#Perms). In this 
case, in Algorithm 1, setting the parameter selection either 
to or to 1 has no effect on the returned candidate role- 



set. The results we obtained by running the three heuristics 
are listed in Figure [2] where both heuristics t-SMA_R-0 and 
t-SMAfl-1 are denoted by t-SMA_R. Both our heuristics be- 
have pretty well on the nine datasets. Considering the size 
of the candidate role-set generated by the heuristics, in four 
cases out of nine (i.e., Healtcare, Domino, Emea, and Fire- 
wall 2) our heuristics provide the same results as CRM. In 
four cases out of nine (i.e., Firewal 1, Apj, Americas small, 
and Americas large) CRM returns a (not so much) smaller 
role-set. Finally, for the Customer dataset t-SMAc returns 
the smallest role-set. Considering the CPU time, our heuris- 
tics outperform CRM with improvements ranging from 50% 
to 90%. If we look at parameters SI and S2 we see that, ex- 
cept for the Emea dataset, CRM has a better performance. 
To improve the results we could reduce the size of the role 
hierarchy RH by running the lattice-based postprocessing pro- 
cedure defined in [6] . According to this procedure, each role 
r£ R containing some other roles is substituted with the 
role r' , where 

r — r\ [J r c . 

r c : r c (Zr 

If r' is empty, then the number of roles is reduced. The sub- 
stitutions continue until the lattice is completely flat (i.e., no 
role contains any other role). After running this procedure, 
the role hierarchy 1ZW will be empty implying |RH| = 0. The 
results obtained by running the lattice-based postprocess- 
ing processing are shown in Figure [3] Notice that the above 
mentioned procedure has not been applied when in Figure [2] 
we have |RH| = 0. As one can see, CRM never computes 
a smaller role-set than the one returned by our heuristics. 
Moreover, considering the parameters SI and S2, our heuris- 
tics in three cases out of nine (i.e., Healtcare, Firewall 2, and 
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Figure 3: Results of the of lattice-based postprocessing procedure 



Customer) achieve the same results as CRM; while, in the 
remaining six cases they return better results. Sometime, 
the CPU time needed by our heuristics is larger than the 
one for CRM but still comparable. 
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Figure 4: Generated roles for America small 

We have also evaluated our heuristics by varying the thresh- 
old constraint. We performed tests on all the nine datasets, 
but due to space limitation we only report the results for the 
America small dataset with t £ {22 + 12 • i | < i < 24}. 
The reported results (see Figures[4]-[6j do not include the ap- 
plication of the lattice-based postprocessing procedure. In 
general, there are few differences between the behavior of 
i-SMAfl-0 and t-SMA^-1. Indeed, the graphics associated 
to them almost overlap. As expected, the number of roles 
increases (see Figure El when the constraint value decreases, 
i.e. when few permissions can be assigned to each role. Our 
heuristics always return a smaller role-set than the one com- 
puted by CRM. According to Figure[5l the value of the con- 



straint t does not affect much the computation time of our 
heuristics (the same happens to CRM unless t < 46). Any- 
way, CRM's computation time is 2.5 times larger than the 
one of t-SMAc and about 20 times larger than i-SMA_R-0's 
(t-SMAfl-1) computation time. 
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Figure 5: Running time for America small 

Considering the parameter 52, according to Figure , we see 
that in our heuristics as the value of t decreases, the param- 
eters 5*2 decreases as well. For heuristic CRM, the value of 
the parameter 5*2 does not change much unless t < 46. For 
310 < t < 166, CRM generate solutions with smaller value 
of S2 with respect to our heuristics; while, for 166 < t < 22, 
our heuristics have a better performance. 

5.2 Synthetic datasets 

In this section, we report the performance evaluation on syn- 
thetic datasets of our heuristics compared to CRM. We fol- 
lowed the approach suggested in [25] generating the datasets 
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Figure 6: Value of 52 for America small 



by a synthetic data generator. Such a generator takes as in- 
put five parameters: the number of roles NR, the number 
of users NU, the number of permissions NP, the maximal 
number of roles MRU that can be assigned to each user, 
and the maximal number of permissions MPR each role 
can have. To generate the role-to-permission assignment, 
for each role r, the generator chooses a random number 
N r 6 [1,MPR], then it randomly chooses N r permissions 
from V and assign them to r. In this way, we construct 
the RPA matrix. To obtain the URA matrix, the generator, 
for each user u, chooses a random number M r £ [1, MRU], 
then it randomly chooses M r roles from the generated ones 
and assign them to u. Then, the UPA matrix is implicitly 
defined. 
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Figure 7: Test parameters with fixed NP/MPR ratio 

We generated datasets using the parameters summarized in 
Figure [7| As the synthetic data generator is randomized, 
for each set of parameters, we run it ten times. On each 
randomly generated dataset (i.e. for each UPA matrix we 
created) both our heuristics and CRM were run. For a 
specific parameter set, all reported results are averaged over 
the ten runs. 
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Figure 8: Role-set size for fixed NP/MPR ratio 

In all our experiments we set the value of the cardinality con- 
straint equal to the maximum number of permissions that 



can be assigned to each role (i.e., t = MPR). We tested the 
heuristics on several different dataset obtained by keeping 
constant some parameters while others ranged over differ- 
ent values. For the sake of brevity, we report here only the 
results of the experiments on the test parameters in FigurelT] 
where the maximum number of permission per roles ranges 
from 10 to 200 (and then the same holds for the value of con- 
straint parameter t), while the ratio NP/MRP is constant 
and equal to 10. 
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Figure 9: CPU time for fixed NP/MPR ratio 

For each set of parameters we report the size of the complete 
role-set generated running the heuristics (see Figure pi as 
well as the CPU time need to compute the complete role-set 
(see Figure [M. We consider also Accuracy and Distance. 
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Figure 10: Accuracy for fixed NP/MPR ratio 
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Figure 11: Distance for fixed NP/MPR ratio 



Accuracy (see Figure 10 1 is defined as the ratio between 



the number of generated roles exactly matching the orig- 
inal roles and the size of role sets generated by the syn- 
thetic data generator (i.e., we measure the percentage of 
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Figure 12: Results of the four heuristics over synthetic datasets 



original roles found by the heuristics). Given a complete 
role-set generated by any of the heuristics, the Distance pa- 
rameter (see Figure 111 measures how different is the role- 
set generated by the heuristic from the original one (i.e., if 
Rg is the role-set generated by the synthetic data genera- 
tor and Rp is the role-set computed by the heuristic, then 
Distance = |7?f\-Rg|). According to data summarized in 
Figures [8HTT| results returned by t-SMA fl -0 and t-SMA fl -l 
are identical and the graphics associated to them overlap. 
Both heuristics are much better than CRM. Indeed, they 
compute in few seconds a role-set having 100% Accuracy 
and null Distance meaning that the role-set generate by the 
synthetic data generator is completely reconstructed for all 
the considered test cases. Notice that CRM's Accuracy is 
less than 20% and Distance and CPU time increases as the 
maximum number of permissions per role does. Our heuris- 
tics t-SMA_R-0 and i-SMA.R-1 always return a role-set con- 
taining about 100 roles; while CRM returns a much larger 
role-set (i.e., 4 to 15 times larger). 

We also evaluated all heuristics using the metrics considered 
in Section [5~T] Results are summarized in Figures[l2]and[T3] 
We can see that, although in some case CRM returns good 
value for |UPA|, in general its performances are not very good. 
We also applied the lattice-based postprocessing procedure 
to the role-set obtained by running the four heuristics. In 
Figure [13] we report its results. The procedure improves 
some of the parameters, flattening, as expected, the role 
hierarchy and increasing the total CPU time. Even with 
the postprocessing, the results achieved by CRM are not 
so accurate. Only for the datasets constructed using the 
first test parameters (i.e., when MPR = 10), CRM returns 
the original number of roles (with a low Accuracy, anyway); 
while, in the other test cases it remains far from the expected 
number. Postprocessing has few influence on t-SMAc, since 
the number of roles does not change, while only the UPA 
values improve and in the same measure 5*1 and S2, too. 

6. CONCLUSIONS 

The role mining process, usually, returns a role infrastruc- 
ture on the basis of the relationships among users and per- 
missions contained in the UPA matrix. However, the defini- 
tion of a role-set really reflecting the internal functionalities 



of the examined organization remains a challenging task. 
The need for managing different kind of constraints in role 
engineering has recently been the focus of many works in 
literature |13| |10| [IB] . The definition and the management 
of constraints in role mining are very important aspects in 
role mining, since they allow the role engineer to control 
the automatic process and introduce some rules that can 
have impact on the retrieved structure. In this paper, we 
have proposed a heuristic capable of returning a complete 
role-set satisfying constraints on the maximum number of 
permissions included in each role. The comparisons made 
show how the results in terms of accuracy, distance, size, 
and computation time improve on a previously presented 
algorithm [13]. Our simple algorithm is easily extensible to 
consider other kinds of cardinality constraints, such as max- 
imum number of users assigned to a role or mutually ex- 
clusive permissions or roles [15]. Furthermore, it is possible 
to investigate on the definition of other kinds of constraints 
regarding the role hierarchy and the semantic associated to 
each role 
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and try to adapt the proposed algorithm in 
order to return a role set satisfying the newly defined con- 
straints. 
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Figure 13: Results of the four heuristics over synthetic datasets after lattice-based postprocessing 
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